Profile-Based Focused Crawling for Social Media-Sharing Websites

نویسندگان

  • Zhiyong Zhang
  • Olfa Nasraoui
چکیده

We present a novel profile-based focused crawling system for dealing with the increasingly popular social media-sharing websites. In this system, we treat the user profiles as ranking criteria for guiding the crawling process. Furthermore, we divide a user’s profile into two parts, an internal part, which comes from the user’s own contribution, and an external part, which comes from the user’s social contacts. In order to expand the crawling topic, a cotagging topic-discovery scheme was adopted for social media-sharing websites. In order to efficiently and effectively extract data for the focused crawling, a path string-based page classification method is first developed for identifying list pages, detail pages, and profile pages. The identification of the correct type of page is essential for our crawling, since we want to distinguish between list, profile, and detail pages in order to extract the correct information from each type of page, and subsequently estimate a reasonable ranking for each link that is encountered while crawling. Our experiments prove the robustness of our profile-based focused crawler, as well as a significant improvement in harvest ratio, compared to breadth-first and online page importance computation (OPIC) crawlers, when crawling the Flickr website for two different topics.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Personalized Recommendation in Social Media: a Profile Expansion Approach

With the success of Web 2.0 applications, various social media websites have been established and become tremendous assets for supporting critical business intelligence applications. The knowledge gained from social media websites can not only meet the objectives of businesses offering them but also help the development of novel and effective services that are better tailored to users’ needs. I...

متن کامل

Accurate and Efficient Crawling for Relevant Websites

Focused web crawlers have recently emerged as an alternative to the well-established web search engines. While the well-known focused crawlers retrieve relevant webpages, there are various applications which target whole websites instead of single webpages. For example, companies are represented by websites, not by individual webpages. To answer queries targeted at websites, web directories are...

متن کامل

Efficient Social Website Crawling Using Cluster Graph ; CU-CS-1056-09

Online social communities have gained significant popularity in recent years and have become an area of active research. Compared with general websites or well-structured Web forums, user-centered social websites pose several unique challenges for crawling, a fundamental task for data collection and data mining of large-scale online social communities: (1) Social websites have more complex link...

متن کامل

Efficient Social Website Crawling Using Cluster Graph

Online social communities have gained significant popularity in recent years and have become an area of active research. Compared with general websites or well-structured Web forums, user-centered social websites pose several unique challenges for crawling, a fundamental task for data collection and data mining of large-scale online social communities: (1) Social websites have more complex link...

متن کامل

Social Sharing Behavior Under E-Commerce Context

In the era of Web 2.0, social networking sites play an important role in generating online information. People create billions of shares such as web pages or videos with friends on these sites every month. The share on the social networking site is visible to all of her or his friends and could be clicked by them and generates traffic back to the website where the information is from. In order ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • EURASIP J. Image and Video Processing

دوره 2009  شماره 

صفحات  -

تاریخ انتشار 2009